Goto

Collaborating Authors

 Intellectual Property & Technology Law


TOKENSWAP: ALightweight Method to Disrupt Memorized Sequences in LLMs

Neural Information Processing Systems

As language models scale, their performance improves dramatically across a wide range of tasks, but so does their tendency to memorize and regurgitate parts of their training data verbatim. This tradeoff poses serious legal, ethical, and safety concerns, especially in real-world deployments. Existing mitigation techniques, such as differential privacy or model unlearning, often require retraining or access to internal weights making them impractical for most users. In this work, we introduce TOKENSWAP, a lightweight, post-hoc defense designed for realistic settings where the user can only access token-level outputs. Our key insight is that while large models are necessary for high task performance, small models (e.g., DistilGPT-2) are often sufficient to assign fluent, grammatically plausible probabilities to common function words - and crucially, they memorize far less. By selectively swapping token probabilities between models, TOKENSWAP preserves the capabilities of large models while reducing their propensity for verbatim reproduction.


Appendix412 Table of Contents

Neural Information Processing Systems

Starting from Grobid's XML output, peS2o filters papers that are too short, have453 incorrect metadata, are in languages other than English, and contain OCR errors using a combination454 of heuristic-and model-based filtering steps. We refer the reader to the datasheet and code for more455 details on this processing pipeline.456 The subset of peS2o included in the Common Pile starts from v3 of the corpus, which contains457 documents from January 1, 1970 to October 6, 2024. We retain full-text papers with CCBY,458 CCBY-SA, or CC0 licenses, or that have been labeled as public domain; metadata is provided459 by the Semantic Scholar APIs [85]. After filtering, this set contains 6.3 million papers, or 35.7460 billion whitespace-separated segments.


The Common Pile v0.1: An8TBDataset of Public Domain and Openly Licensed Text

Neural Information Processing Systems

Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.



PANORAMA: ADataset and Benchmarks Capturing Decision Trails and Rationales in Patent Examination

Neural Information Processing Systems

Patent examination remains an ongoing challenge in the NLP literature even after the advent of large language models (LLMs), as it requires an extensive yet nuanced human judgment on whether a submitted claim meets the statutory standards of novelty and non-obviousness against previously granted claims--prior art--in expert domains. Previous NLP studies have approached this challenge as a prediction task (e.g., forecasting grant outcomes) with high-level proxies such as similarity metrics or classifiers trained on historical labels. However, this approach often overlooks the step-by-step evaluations that examiners must make with profound information, including rationales for the decisions provided in office actions documents, which also makes it harder to measure the current state of techniques in patent review processes. To fill this gap, we construct PANORAMA, a dataset of 8,143 U.S. patent examination records that preserves the full decision trails, including original applications, all cited references, Non-Final Rejections, and Notices of Allowance. Also, PANORAMA decomposes the trails into sequential benchmarks that emulate patent professionals' patent review processes and allow researchers to examine large language models' capabilities at each step of them. Our findings indicate that, although LLMs are relatively effective at retrieving relevant prior art and pinpointing the pertinent paragraphs, they struggle to assess the novelty and non-obviousness of patent claims. We discuss these results and argue that advancing NLP, including LLMs, in the patent domain requires a deeper understanding of real-world patent examination.


US judge dismisses Musk's xAI trade secret lawsuit against OpenAI

Al Jazeera

US judge dismisses Musk's xAI trade secret lawsuit against OpenAI A United States federal judge has dismissed a lawsuit by Elon Musk's artificial intelligence company xAI that accused rival Sam Altman's OpenAI of stealing trade secrets for chatbots. US District Judge Rita Lin in San Francisco said on Monday that xAI failed to show that OpenAI induced former xAI senior engineer Xuechen Li to divulge confidential information related to its Grok chatbot, or that OpenAI engineers knew Li might have disclosed any. She dismissed an earlier version in February. The lawsuit originally filed last September focused on broader alleged misappropriation of confidential information, including source code, by xAI employees who left for jobs at OpenAI. Monday's decision is Musk's second legal loss against OpenAI in four weeks. On May 18, a federal jury ruled against Musk, the world's richest person, in his $150bn lawsuit accusing OpenAI and Altman of "stealing a charity" by betraying the company's original mission as a nonprofit to enrich themselves.


PANORAMA: A Dataset and Benchmarks Capturing Decision Trails and Rationales in Patent Examination

Neural Information Processing Systems

Patent examination remains an ongoing challenge in the NLP literature even after the advent of large language models (LLMs), as it requires an extensive yet nuanced human judgment on whether a submitted $\textit{claim}$ meets the statutory standards of $\textit{novelty}$ and $\textit{non-obviousness}$ against previously granted claims--$\textit{prior art}$--in expert domains. Previous NLP studies have approached this challenge as a prediction task (e.g., forecasting grant outcomes) with high-level proxies such as similarity metrics or classifiers trained on historical labels. However, this approach often overlooks the step-by-step evaluations that examiners must make with profound information, including rationales for the decisions provided in $\textit{office actions}$ documents, which also makes it harder to measure the current state of techniques in patent review processes. To fill this gap, we construct PANORAMA, a dataset of 8,143 U.S. patent examination records that preserves the full decision trails, including original applications, all cited references, $\textit{Non-Final Rejections}$, and $\textit{Notices of Allowance}$. Also, PANORAMA decomposes the trails into sequential benchmarks that emulate patent professionals' patent review processes and allow researchers to examine large language models' capabilities at each step of them. Our findings indicate that, although LLMs are relatively effective at retrieving relevant prior art and pinpointing the pertinent paragraphs, they struggle to assess the novelty and non-obviousness of patent claims. We discuss these results and argue that advancing NLP, including LLMs, in the patent domain requires a deeper understanding of real-world patent examination.


CNN is the latest media company to sue Perplexity

Engadget

The lawsuit, which was filed Thursday, claims that the AI company unlawfully crawls, scrapes, copies, and distributes CNN's content from CNN Digital Platforms and third-party platforms. It also accuses the AI tools of reproducing verbatim copies of its articles, including paywalled stories, in query responses to users. Perplexity's AI tools allegedly have incorrectly attributed hallucinated content to CNN, which the company says in the suit violates its trademark. CNN's lawsuit stands for the proposition that Perplexity, a company valued at tens of billions of dollars, should not be able to steal from entities that create the original content Perplexity exploits, a CNN spokesperson said in a statement to the outlet. The public rely on high quality news journalism reported by human beings to understand their world, which is frequently dangerous and expensive to produce.



Taylor Swift files to trademark voice and image after AI concerns

BBC News

Taylor Swift has applied to trademark her voice and appearance in an apparent attempt to protect herself from artificial intelligence impersonations. The pop superstar has lodged three trademark applications in the US - one using a photo of herself on stage during her Eras Tour, and the other two being audio clips of her introducing herself while promoting her last album. AI-generated versions of Swift have cropped up in various ways in recent years - from explicit images to a fake election ad in which she appeared to urge people to vote for Donald Trump. The move comes after actor Matthew McConaughey became the first celebrity to use trademark rules to attempt to protect his voice and image from AI misuse earlier this year . Trademark applications are a relatively new way for celebrities to combat the growing issue of AI rip-offs.